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Abstract. Theory of splicing is an abstract model of the recombinant behaviour of DNAs. In a splicing 
system, two strings to be spliced are taken from the same set and the splicing rule is from another set. 
Here we propose a generalised splicing (GS) model with three components, two strings from two languages 
and a splicing rule from third component. We propose a generalised self assembly (GSA) of strings. Two 
strings mxvi and U2XV3 self assemble over x and generate U1XV2 and uixv\. We study the relationship 
between GS and GSA. We study some classes of generalised splicing languages with the help of generalised 
self assembly. 

1 Introduction 

Tom Head proposed [5] an operation called 'splicing', for describing the recombination of DNA sequences under 
the application of restriction enzymes and ligases. Given two strings ua(3v and u'a'fi'v' over some alphabet 

V and a splicing rule a#/3$c/#/3', two strings ua(3'v' and u'a'fiv are produced. The splicing rule a#/3$a'#/3' 
means that the first string is cut between a and (3 and the second string is cut between a' and /3', and the 
fragments recombine crosswise. 

The splicing scheme (also written as H-scheme) is a pair u = (V, R) where V is an alphabet and R C 
V*#V*%V*4t z V* is the set of splicing rules. Starting from a language, we generate a new language by the 
iterated application of splicing rules in R. Here R can be infinite. Thus R can be considered as a language over 

V U {#,$}. Splicing language (language generated by splicing) depends upon the class of the language (in the 
Chomskian hierarchy) to be spliced and the type of the splicing rules to be applied. The class of splicing language 
H(FLi, FL2) is the set of strings generated by taking any two strings from FL% and splicing them by the strings 
of FL 2 . FL\ and FL2 can be any class of languages in the Chomskian hierarchy. Detailed investigations on 
computational power of splicing is found in [9]. 

Theory of splicing is an abstract model of the recombinant behaviour of the DNAs. In a splicing system, the 
two strings to be spliced are taken from the same set and the splicing rule is from another set. The reason for 
taking two strings from the same set is, in the DNA recombination, both the objects to be spliced are DNAs. 
For example, the splicing language in the class H{FIN, REG) is the language generated by taking two strings 
from a finite language and using strings from a regular language as the splicing rules. Any general 'cut' and 
'connection' model should include the cutting of two strings taken from two different languages. The strings 
spliced and the splicing rules have an effect on the language generated by the splicing process. In short, we view 
a splicing model as having three components, two strings from two languages as the first two components, and 
a splicing rule as the third component. Our proposal of a generalised splicing model (a formal definition of GS: 
Generalised splicing, is given in section 2 definition l)will be: 

GS(L 1 ,L 2 , L 3 ) := {21, z 2 : (x, y) ^ r (z\, z 2 ), x e L x ,y e L 2 , r € L 3 }. 

Instead of taking two strings from same language, as being done in the theory of splicing, we take them from 
two different languages. We cut them by using rules from a third language. This means, taking an arbitrary 
word u>i(6 Li) and an arbitrary word from 102(6 L 2 ), we cut them by using an arbitrary rule of L 3 . If L\ = L 2 
in the generalised splicing model, we get the usual if-system. 

The motivation of the above proposal of a generalised theory of splicing comes from the self assembly of 
strings [4]. Two strings uv and vw self assemble over v and generate uvw. Here, the overlapping strings appear at 
the end of one string and at the beginning of the other. Then comes the question: What will be the generalisation 
if we do not restrict the overlapping strings to be in the end (or the beginning) of the strings that participate 
in the assembling process. As an answer to the above question, we propose a generalised self assembly (GSA) 
of two strings (definition 2). Two strings u±xvi and u 2 xv 2 self assemble over the sub-string x and generate the 
strings u\xv 2 and u 2 xv\ , as illustrated in the right hand side of the figure 1. The generated words indicate that 
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Fig. 1. Equivalence of generalised splicing and self assembly 



the x-self assembly of uq and u>2 (self assembly with x as the overlapping string) is just a generalised splicing 
of uq and W2 with a splicing rule 

We take advantage of this equivalence of GS and GSA and plan to investigate the generalised splicing for 
some classes of languages in Chomskian hierarchy. Since an investigation of the classes of languages under the 
generalised splicing model is going to be a more complicated one, compared to the existing ii-system in all 
sense, we narrow down the investigation of the generalised splicing model by taking L 3 = V + U {(uq, 102) : uq 6 
L\,W2 £ £2} (V is the set of common symbols that appear in L\ and L 2 ), which constitutes the set of splicing 
rules: a word w G V + indicates that the splicing rule will be where and $ have the usual meanings 

as in ff-system; a pair of words (-uq, w 2 ) £ £3 indicate that the splicing rule will be of the form w\#%w 2 #. The 
very purpose of including the pair (wi,w 2 ) in L3 is to include the words that are being spliced, in the set of 
words generated by the GS. The necessity of including the parent words is discussed at the end of section 2. 

Though the whole theory of splicing can be rewritten with the generalised splicing system, nevertheless, in 
this paper, we investigate GS{L\, L 2 , L 3 ) for L\,L 2 £ {REG, LIN, CF} and L 3 is as given in the previous 
paragraph. For an investigation, we define the GSA of automata, regular grammar, linear grammar, context 
free grammar (apart from the GSA of two languages). In this paper, section 2 discusses the definitions of GS 
and GSA. The subsequent sections discuss the generalised self assembly of finite languages, regular languages, 
linear languages and the context free languages. 



2 Definitions 

Throughout this paper, we follow the terminologies and the notations as in [2], [9]. 

Definition 1 (Generalised splicing scheme) Generalised splicing scheme is defined as a triplet <jq = (V\, 
V 2 ,K), where V\, V2 are alphabets, and R C V^^V^SV^^tV^*. Here R can be infinite, and R is considered as a 
set of strings, hence a language. For a given oq, and a languages L\ C V{ and L2 C V 2 * , we define 

o-g{L\,L 2 ) = {z 1 ,z 2 : (x,y) \= r (z 1 ,z 2 ), for x £ L 1: y € L 2 ,r £ R}. 

Given three families FL\, FL2, FL3; we define 

GS(FL 1 ,FL 2 ,FL 3 ) = {a G {L 1 ,L 2 ) : L x £ FL U L 2 £ FL 2 , R £ FL 3 }, 

i.e. GS(FLi, FL2, FL3) is the set of strings generated by splicing a language of FL\, and a language of FL2, 
by using a set of splicing rules in FL 3 . 

Note 1. Whenever we refer 'generalised splicing', we mean generalised 2-splicing. 

Definition 2 (Generalised self assembly) Let uq £ L\,W2 £ £2 be any two words. The generalised x-self 
assembly operation GSA x (w\, W2) over (e ^f)x £ sub(uq) R sub(u>2) *s defined as follows: 

GSA x {w\, W2) = {u\XV\, U2XV2, U\XV2,U2XV\ : uq = u\xvi, W2 — U2XV2, }■ 



The self assembled words are the words that are generated when we trace from a left corner to a right corner 
in the figure 2. Given any two languages L\ and L 2 , over the alphabet set V± and V2 respectively, we define- 



GSA( Wl ,w 2 ) :={jGSA x (w 1 ,w 2 ), 

X 
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Wl = UlXVl 
Ul^ y VI 



- Over lapping string x G S + 



U2^* VI 

W2 — U2XV2 

Fig. 2. Super impose over the common sub-string x 

and 



GSA{L U L 2 ) := (J GSAfai, w 2 ). 



Though a self assembly process will not include the parent words w\ and w 2 (when w\ ^ w 2 ), in the above 
definition, we purposefully include the parent words for the sake of more clarity of studying the GS through the 
GSA approach, i.e. we plan to investigate GS(L\, L 2 , L3) where L3 = V + U {(wi,w 2 ) : w\ G L\,w 2 G L 2 } (V 
is the set of common symbols that appear in L\ and L 2 ). The pair (wi,w 2 ) in the set of splicing rules means 
that w\ will be cut after w\ and w 2 will be cut after w 2 . Note that,the parent words wi and w 2 are included in 
GS(wi, w 2 ). 

With the motivation given in section 1 and with the above two definitions, we have the following theorem. 

Theorem 1. gsags] Let L\ and L 2 be any two languages. Let V = Vl 1 fl Vl 2 , where and Vl 2 are the 
alphabets of Vj J1 and Vl x respectively. Then 

GS(L U L 2 ,R) = GSA(L U L 2 ), 

where 

R = V + U {(wi, w 2 ) : w G £1, w G L 2 } 



3 Generalised Self assembly of finite languages 

This is the simplest and most trivial case. Suppose there are two finite languages L\ and L 2 , each containing 
ni and n 2 words respectively. Given any two words, there can be only finitely many common symbols between 
them. So only finitely many new words can be generated by self assembly Since the parent languages are finite 
the end product S(Li, L 2 ) contains only finite number of words. Thus we get the following theorem:- 

Theorem 2. Self assembly of two finite languages is finite. So we may write, 

GSA(FLN, FLN) = FIN. 



4 Generalised Self assembly of regular languages 

In this section we shall investigate behaviour of the self assembly of two regular languages. We know that regular 
languages can be generated by regular grammar and are also accepted by a finite automata. We shall show that 
self assembly of any two regular languages is regular. We shall prove it by both the automata and grammar 
approach. 



4.1 Generalised Self assembly of regular grammar 

In this section we shall describe: given any two regular grammars Gi,G 2 of languages L\ and L 2 respectively, 
how to construct a grammar for the self assembly language S(L\, L 2 ). 

Definition 3 (Self assembly of REG grammars) Let Gi = (Ni,Ti, Ri, Si),i — 1,2, be the regular gram- 
mars of languages L\ — L{G\) and L 2 = L(G 2 ), where Ni 's are the set of non terminals, N\ H N 2 — 0, Tj 's are 
the set of terminals and T\ n T 2 ^ ( only then we can self assemble ), Si 's are the starting symbols and Ri 's are 
the rules respectively. 

The generalised self assembly of G\ and G 2 , written as GSA{G\,G 2 ) is defined as 

G = (Ni U N 2 U {S}, Ti U T 2 , 5, R), S $ N x U N 2 , 
where R includes the following rules: 
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1. S — > S U S — » S 2 . 

I?. j4ZZ i/ie ritfes o/i?i and i? 2 . 

5. For a G T 1 nT 2 , /or each pair of the rules A — ► aB G i?i and A' — > aB 1 G i? 2 , include the rules A — ► aB' 
and A' — > aB in R. 

Note 2. REG grammars are ones whose rules are of the form A — ► aB or A — ► a, where A is non-ter minal 
and a is a terminal. The two rules can be jointly expressed as A — ► a-f where 7 is a non-terminal or 7 = e. 

Example 1 Let Gi = ({Si}, {a, 6}, R l = {Si — > aSi, Si — > b}, Si), and G 2 = ({S 2 }, {a, 6}, R 1 = {S 2 — > 
&S 2 , S 2 — ► a}, Si). Li = i(Gi) = a*o andL 2 = £(G 2 ) = o*a. Then the GSA grammar is G= ({S,Si,S 2 },{a,b}, 
R,S), where the rules R are given as 

S — ► Si Si — > aSi|6|6S 2 |a 

S — ► S 2 S 2 — > aSi|6S 2 |a|6. 

Note that the language generated by G, L(G) will include the languages L{G\) and L{G 2 ). Thus GSA of two 
regular grammars is again regular. In the same spirit of the above definition, we define GSA of linear grammars 
and GSA of context free grammars ( for this, we consider the Greibach normal form for CFG). 

Theorem 3. Let G\ and G 2 be any two regular grammar. Then 

L(GSA{G 1 ,G 2 ) = GSA{L(G!), L(G 2 )). 

Proof. Part I: 

Case I w G L{G\) or w G L(G 2 ). It is trivial, since the rules R\ and R 2 are included in GSA{G\,G 2 ). 
Case II w i L(Gi) or w £ L(G 2 ). Let w G GSA{L{G 1 ),L{G 2 )). There exists w x G L{G 1 ), w 2 G L(G 2 ), 
a G S Wl n S W2 , and w = GSA(wi,w 2 ) = uav such that ui\ — uau\, w 2 = v\av, where u G prefix(wi), v\ G 
prcfix(w 2 ), U\ G suffix(wi), v G suffix(u> 2 ). 
Since w\ G L(G\), there exists a sentential form 

Si => Gl uA => uaB =^a 1 uaui : A — ► aB G i?i 

for deriving wi = uau\. Similarly there exists a sentential form 

S 2 =>g 2 v\A' =>• wiaS' =^g 2 wiaw : A' -> aB' G R 2 

for deriving w 2 = v\av. Since A — > aB G i?i and A' — ► aB' G i? 2 implies that A — > aB' G R(GSA(G\, G 2 )), we 
have the sentential form 

S =>GSA(Gi,G 2 ) ^1 =^Gi Mj4 =>GSA(Gi,G 2 ) Ua - B ' ^*G 2 uav> 

i.e. 

S =^GSA(Gi,G 2 ) uav = w - 

Hence 10 e L(GSA(G U G 2 ) GSA{L{G 1 ),L{G 2 )) C L(GSA(GuG 2 ). 
Part II: 

Let «; G L(GSA(G\, G 2 )). Without loss of generality, we assume that w L{G\) and L(G 2 ). 
Since w G L(GSA(G\,G 2 )), w can be expressed as w = uav. So there exists a sentential form 

S =>GSA(Gi,G 2 ) S l =^Gi Mj4 =>GSA(Gi,G 2 ) MaB ' =^G 2 uaw 

Since A aB' G R(GSA(Gi,G 2 )), but Ri,R 2 (because A, B ^ iV 2 and A',B' <£ Ni), there exists productions 
of the type A —> aB G i?i and A' aB' G R 2 . 

This implies 

51 uA =>d uaB => Gi mi, using the production A — > aB 

and 

5 2 =^g 2 =^g 2 J/ a ^' J/aw, using the production A' — ► aB' 
This gives 3 wax G L(G\) and yaw G L{G 2 ) corresponding toro = uati G GSA(L(G\), L(G 2 )). 

Hence L(GSA(Gi,G 2 ) C GSA(B(Gi), £(G 2 )). Hence the result. 
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4.2 Generalised Self assembly of finite automata 

If L\ and L2 any two REG languages, there exists two finite automatas Mi and M2 such that L\ = L(M±) and 
L2 = L(M 2 )- While L\ and L2 can self assembly by string overlapping, it is interesting to explore whether the 
corresponding automata self assemble to an automata M such that the language of the self assembled automata 
is same as self assembly of languages. If a word w is accepted by a FA, every symbol a in w corresponds to 
an edge 'a' in the transition diagram of the FA. This gives the idea that the FA's can be self assembled by the 
overlapping edge with the same level. Thus we have the following definition: 

Definition 4 (Generalised self assembly of two FA's) Let Mi — (Qi,Vx,Si,qi,Fi) and Mi — {Qi^i^i) 
Q2, F 2 ) be two machines such that Vx fl V 2 ^ 0. The generalised self assembly of Mi and M 2 written as 
GSA(Mi,M 2 ) is defined as 



M = (Q = Qi U Q a , V x U V 2 U {e}, S, q , F\ U F 2 ). 



5 is defined as follows 

1. S(q ,e) = {qi,q 2 }- 

2. Va G Vi UV 2 ,q E Q 



S(q,a) 



Si(q,a) q eQi 
S 2 (q,a) q G Q 2 



3. For every pair of transitions 5x(qi,a) = qj and 5i(q[,a) = q'j, qi G Qi, q\ G Q2, we include two new 
transition rules, 

8{q i ,a) = q' j 8{q[,a) = q r 

Note that the language accepted by the GSA of Mi and M2 include L(M\) and L{M2). 



It is observed that when Gi and G2 are regular grammars, we have 

L(GSA{Gi,G 2 )) = L(GSA(Mi,M 2 )), 

where L(Gi) = L{M X ) and L(G 2 ) = L(M 2 ). 

The idea behind the self assembly of two FAs is the overlapping of the directed edge labelled with same 
symbol in the transition diagram of both the finite automatas. Every transition rules corresponds to a directed 
edge in the transition diagram. Let 5{qi, a) = qj and S(q' i , a) = q'j be the transition in Mi and M2 respectively. 
In the self assembly of Mi and M2, the directed edge in the transition diagram that corresponds to the above 
transition overlap: When the edges overlap, the states qi and q[ overlap. To add more clarity, the figure 3 is 





Machine Mi 



Machine M 2 




Machine M 



Fig. 3. Part of self assembled finite automata. Machine M\ and M2 are self assembled at the transitions edge a. The 
new FA M is drawn to specifically highlight the assembled states. 



drawn in a way that all the transitions are preserved. 
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Theorem 4. Let Mi and M 2 be any two finite automatas. Then 

L(GSA(M U M 2 )) = GSA(L(Mi), L{M 2 )). 

Proof. Without loss of generality, we assume that there is only one directed edge labeled a in the transition 
diagram of M\ and M 2 , which can overlap. Further we can assume that all states of M\ and M 2 are differently 
labeled. 

Part I Let w G L(GSA(M U M 2 )). 

Case I we L(Mij) or w E L(M 2 )). Since L{M\), L(M 2 ) C GSA(L(M 1 ) 7 L(M 2 )), we have w £ GSA(L(Mi), L(M 2 )). 
Case II w ^ L(Mi)) and u> ^ L(M 2 )) . There exists a path from <7o to any one of the final states involving 

the edge a the transition graph of M such that the path preceding the edge a is in Mi (or in M 2 ), and 

the path succeeding the edge a is in M 2 (or in Mi). 

=> w = wiaw 2 , where u>iprefix(:r), x £ .L(Mi) (or u^prefb^a;), x £ L(M 2 )) and i« 2 sufhx(a;), a; £ L(M 2 ) 
(or w 2 prcfix(a;), a; £ L(Mi)); i.e. is the labels of the path in M x (or in M 2 ), and w 2 is the labels of 
the path in M 2 (or in Mi). 

=>■ u> can be written as the self assembly of the words wiaw[ and w' 2 aw 2 , where wiaw[ £ L(M\) and 

w 2 aw 2 £ L(M 2 ). 

=>we GSA{L(Mi),L{M 2 )). 

Hence L(GSA{M U M 2 )) £ G5A(L(Mi), L(M 2 )). 
Part II Let w £ GS , A( J L(Mi), L(M 2 )). 

^w = GSA(x, y) : x £ L(Mi), y £ L{M 2 ). 

=> w = wiaw' 2 or w 2 aw^ where a; = wiaw'j, y = w 2 aw' 2 . 

=> There exists a path with label w from q to any one of the final states in the transition graph of M, 
involving the edge a. 
^we L(GSA(Mi,M 2 )). 

Hence the result. 

Combining the results above we get the following theorem. 

Theorem 5. Generalised self assembly of two regular languages is regular. So we may write, 

GSA(REG, REG) = REG. 

We may also go a step further. For any Li £ FIN we can generate an automata Mi, in this way: for each 
word, make an automata which accepts only that word. All together this will make a finite automata, with a 
unique starting symbol, which may take the empty string e and links to each of the individual automatas. Now 
given a regular language L 2 £ REG, we have an automata M 2 accepting it. We can self assembly them by 
the method described in theorem 4. The resultant is again a finite automata. Since L 2 C GSA(L\,L 2 ), by our 
construction, this automata also accepts infinite number of words. We can summarise this as:- 

Theorem 6. Self assembly of regular and finite languages is regular. So we may write, 

GSA(FIN, REG) = REG. 

5 Generalised Self assembly of linear languages 

Linear languages (written as LIN) are the ones which are characterised by the following grammar rules. 

X — » PiYP 2 X — ► P, (1) 

where X and Y are non-terminals (N), and Pi, P 2 , P3 are words over terminals(T) [2]. If Pi (rcsp. P 2 ) is e 
the grammar is called left-linear (resp. right-linear). Any linear language can be generated by right (or left) 
linear grammar. Also they are equivalent [2]. Hence for our purpose we convert all the grammars of the form of 
right-linear only, i.e. we are only considering rules of the form: 



X — > PiY 



X^P, 
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where P\,P 2 G T + . Again we may further introduce new non-terminals, such that each rule is of either of the 
form: 

X — ► aY X — ► a, (2) 

where seTU {e} and Y € N+. 

Method for self assembly of LIN grammar: 

Now we use similar process as given in definition 3. Suppose we have L\,L 2 G LIN. We construct grammar 
d = (Ni, Ti, Si,Ri), i = 1, 2 for them such that N\ n N2 = and i?i's are of the form of equation 2. 
Define a grammar G = (N, T, 5, i?) where N = Ni U N 2 , T = Ti U T 2 , 5 is the new starting symbol, and the 
rules of i? are: 

1. S— ►Si.S— >S 2 . 

2. All the rules of R x and i? 2 . 

3. For a G 7\ (~l T 2 , for each pair of the rules A\ — ► aji G R\ and A 2 — ► a7 2 G J? 2 , include the rules 
A\ — > a7 2 and A 2 — ► 071 in _R, where 71 G N% and 7 2 G N%. 

The analogous result of theorem 3 follows the same line of argument. Thus we can also conclude that: 

Theorem 7. Self assembly of two linear languages is linear; i.e. 

GSA(LIN, LIN) = LIN. 

6 Generalised Self assembly of context free languages 

We self assemble CF grammars, and thus show that the self assembly of two CF languages is again a CF 
language. Instead of using general grammar rules, we take the help of Greibach normal form [6]. To use this, 
we can assume without loss of generality, that the parent languages are s free. Now, in Greibach normal form 
each rule is of the form A — ► a-f, where 7 G N* . We use exactly the same method used for linear grammar. 
Same lines of arguments give us: 

Theorem 8. Generalised self assembly of two context free languages is context free; i.e. 

GSA(CF, CF) = CF. 

7 Conclusion 

In all definitions of GSAs of languages, grammars (definition 3) and FAs (definition 4), the parent words are 
included in the words generated by the GSA. In fact, in any self assembly process of W\ and w 2 , u>i w 2 will be 
generated only when w\ = w 2 - But, in our definition of GSA, we prefer to include w\ and w 2 (even if W\ 7^ w 2 ) 
in GSA(wi,w 2 ) with a purpose. Though we can define the GSA of grammars (as well as FAs) so that the parent 
words arc not included in the words generated, the process will be highly complicated. The main purpose of 
this paper is just to study the generalised splicing in the self assembly approach. For the sake of not loosing 
clarity of our approach in this study, we prefer to include the parent words in all our definitions, namely GS of 
languages, GSA of languages, and GSA of grammars. 

Thus, we have proved that GS{FIN, FIN, R) = FIN, GS(REG, REG, R) = REG, GS(FIN, REG, R) = 
REG, GS{LIN,LIN,R) = LIN and GS(CF,CF,R) = CF, where R is as mentioned as in Theorem 1. This 
study can further be extended to study the other generalised splicing classes of languages. 
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